OVERVIEW

This readme file describes annotation of several sets of TREC queries with named entities listed in Freebase. We hope this dataset will be particularly useful in conjunction with the forthcoming release of similarly annotated ClueWeb corpora (2009 and 2012 versions).

The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as we strived for high precision (and, by necessity, lower recall). For each entity we recognize with high confidence, we provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and the confidence level.

Sample input (from TREC topic 101, year 2009):
Find information on President Barack Obama's family history, including genealogy, national origins, places and dates of birth, etc.

Sample output (the fields are tab-separated):
Barack Obama<tab>30<tab>41<tab>/m/02mjmr<tab>0.99830526

In this example,
- "Barack Obama" is the entity mention recognized in the input text.
- 30 and 41 are the beginning and end byte offsets of the entity mention in the input text.
- /m/02mjmr - Freebase identifier for the entity. To look up the entity in Freebase, just prepend the string "http://www.freebase.com" before the identifier, like so: "http://www.freebase.com/m/02mjmr".
- 0.99830526 - confidence score of recognizing this particular entity.

Some queries or parts thereof do not have any annotations (and are not included in the annotated files) because no Freebase entity was recognized in them with high confidence.


DATA DESCRIPTION

1)  TREC Web topics (ClueWeb09, 2009--2012)

The following 4 sets of topics were annotated:
a) 2009: topics 1-50
http://trec.nist.gov/data/web/09/wt09.topics.full.xml
b) 2010: topics 51-100
http://trec.nist.gov/data/web/10/wt2010-topics.xml
c) 2011: topics 101-150
http://trec.nist.gov/data/web/11/full-topics.xml
d) 2012: topics 151-200
http://trec.nist.gov/data/web/12/full-topics.xml

These files are stored in the directory "TREC Web topics/Original topic files/" (we preserved the original file names but prepended the year at the beginning, like so: 2009_wt09.topics.full.xml).

Each topic contains several parts, including query, description, and a number of subtopics (numbered starting from 1). We annotated all parts of each topic except for 'query', which is usually very short and missing case information.

To perform annotation, we first used the script "TREC Web topics/preprocessing_script.pl" to extract the description and the subtopic fields from the original XML topic files. The results of this preprocessing are available in the directory "TREC Web topics/Preprocessed topic files/".

The annotation results (for all the topics combined) are available in the file "TREC Web topics/Annotated TREC Web topics 2009-2012.txt", and use the abovementioned annotation format. The annotations come in sets, grouping together all the entities identified in a given topic/subtopic. For instance, the sample annotation above is preceded by the following line:
topic-1-description,
which means that the entity "Barack Obama" was identified in the "description" part of topic 1.

Since the number of annotations in these topics is fairly small, they have been reviewed by human annotators, and the errors they found were manually corrected. We estimate the number of remaining errors to be under 1%.


b) 2009 Million Query Track queries

The original dataset is described here: http://trec.nist.gov/data/million.query09.html and contains 40,000 queries.

The original query file is available as "Million query /09.mq.topics.20001-60000". It was preprocessed using the script "Million query track/preprocessing_script.pl" to extract the query text, and the preprocessing result (which was the input to annotation) is stored in the file "Million query track/09.mq.topics.20001-60000.queries".

The annotation results are available in the file "Million query track/Million query track 2009.txt" in the above format.

In this dataset, there is only one part in each query, and each group of annotations is preceded by the query id, like so:
query-20001-1,
which means that the following annotations belong to query number 20001.

This dataset is much bigger than the TREC topics, and it contains 40,000 queries. It was not possible to verify all the automatic annotations manually, but based on a small sample we believe the error rate to be under 3%. The errors identified in this small sample were not corrected to keep the data consistent.


CITATION

If you use this data in a publication, please cite it as
Evgeniy Gabrilovich and Amarnag Subramanya, "TFQ1: Freebase annotation of TREC queries, Version 1 (Release date 2013-05-30, Format version 1, Correction level 0)", May 2013.
Please also include in the citation the following URL where the data is available: http://lemurproject.org/clueweb09/related-data.php


ACKNOWLEDGMENTS

This data set was prepared by Evgeniy Gabrilovich and Amarnag Subramanya (Google).

Thanks to John Giannandrea, Jesse Saba Kirchner, Jeremy O'Brien, Dave Orr, Fernando Pereira, Michael Ringgaard, and Dave Price for making this release possible.
Thanks also to Charlie Clarke, Jaap Kamps, Don Metzler, and Ian Soboroff for comments.
Special thanks to Jamie Callan for helpful suggestions and for hosting the annotated data at CMU.
